Concurrent evaluation of multiple filters with runtime substitution of expression parameters

ABSTRACT

The invention introduces a mechanism to overcome the limitations of the existing methods for analyzing large amounts of data. The end-user selects the various filter expression types and any required parameters used to condense The data, and this allows interactive selection of the most interesting ones as well as the ability to vary parameters for those filter expressions requiring them. These selected filter expressions are used to produce filtered views of the data in the cross-tabulations, so that the end-user gains a better grasp of the significant attributes of the data by having less important data either omitted, or aggregated into arbitrary groups. Because all of the permitted filter expressions are pre-computed there are significant improvements in usability and effectiveness. Further, since the pre-computation of these quite complex functions is performed concurrently there is potential for significant savings in processing resources.

TECHNICAL FIELD

[0001] The present invention relates to the field of computing. Moreparticularly, it relates to a method for performing analysis of data.

BACKGROUND OF THE INVENTION

[0002] The field of database management systems is well known andunderstood. The increasing size and usage of databases has lead to newproblems. In possibly large, possibly sparse cross-tabulations computedfrom data contained within a database, it is always a challenge topresent the end-user with that subset of data with The highest semanticcontent obtainable. This is done by omitting or consolidating redundantor unimportant information by deleting or aggregating specified rowsand/or columns in the cross-tabulation so that only the more useful datais shown.

[0003] In the past, various techniques have been used To reduce theamount of information presented to the end-user. These techniquesinclude simple ploys such as removing complete rows or columns of datacontaining only zero data or no data at all. Other more complextechniques have also been used to aggregate data so that the end-user isnot overwhelmed by the sheer quantity of values, many of themsufficiently small that they can be discounted when assessing theoverall picture. An example of this approach is the use of the so-called“Pareto rule”, which, in summary, postulates that in many sets of data,a large proportion of the data values are small and uninteresting, andthus may with advantage be grouped or ‘rolled up’ into a pseudo-categorynamed, typically, “Other”. (This ‘rule’ is named for Vilfredo Pareto, anItalian economist and sociologist of the early 1900s. It is based on theunequal distribution of things in the universe, and paraphrased statesthat “80% of wealth is in the hands of 20% of the population”).

[0004] Other similar functions, herein referred to as filterexpressions, have been adopted to assist the end-user in interpretingdata. Each such function generally requires significant ad hocdevelopment, and frequently not all of them have been available to theend-users of any one system. Because of the ad hoc nature of previousimplementations, there have been few opportunities to tale advantage ofthe processing savings available if several functions are evaluated atonce.

[0005] Typical end-users of large databases are not always familiar withor even aware of, these functions, and in consequence are not able tomanipulate the data effectively.

[0006] What is needed is an improvement in the usability of databasesand in efficiency of processing the data.

SUMMARY OF THE INVENTION

[0007] According to the invention, there is provided a computer-basednethod of evaluating data by selecting the highest semantic content of atable of data, comprising the steps of: constructing a cross-tabulationof data from one or more databases; ascertaining filter expressions tobe applied to said data; evaluating said filter expressions using saiddata; and storing the results of said evaluation in a status table.

[0008] The present invention introduces a mechanism to overcome thelimitations of the existing methods for analyzing large amounts of data,which improves their usability as well as efficiency of processing.Examination of The problem led to the realization that there is indeed arelatively small number of possible filter expressions, or rather,filter expression types, which are useful in the context of manipulatingcross-tabulations, particularly, but by no means limited to, largesparse ones. Typically, the end-user is presented with a list from whichto select the various filter expression types used to condense the data,and this allows interactive selection of the most interesting ones.

[0009] The invention is a computer-based method of evaluating data byselecting the highest semantic content of a table of data. This isachieved, in one embodiment, by constructing a cross-tabulation of datafrom one or more databases, then ascertaining which filter expressionsare to be applied to The data. The results of the evaluations of thefiltering expressions using the data are then stored in a status tablefor later use.

[0010] In addition to permitting the end-user to select the filterexpressions to be applied to the cross-tabulations, the invention allowsthe end-user to choose parameters or arguments for those filterexpressions requiring them. All of the selected filter expressions canthen be used as ‘filters’ of the data in the cross-tabulations, so thatthe end-user gains a better grasp of the significant attributes of thedata by having less important data either omitted, or aggregated intoarbitrary groups.

[0011] An additional benefit of the invention is that fasterpresentation (or evaluation and display) of the results occurs becauseall of the permitted filter expressions can be pre-computed. This fasterpresentation leads to significant improvements in usability andeffectiveness.

[0012] It can also be seen that in geographically dispersed systems, anyreduction of the quantity of data presented to the end-user provides theadditional benefit of lessening the system resources required totransfer that data between locations.

[0013] In some instances, the invention involves the pre computation ofquite complex functions. Although this can be expensive in processingtime, the improvements in end-user results and presentation speed, aswell as the resultant savings in data transfer volumes, often outweighthis cost. Pre-computation is especially beneficial when it isanticipated that The table will be used multiple times, which is morelikely where the end-user is analyzing the data interactively. Further,when performed concurrently, the computation cost for several parametersused in selected functions does not increase linearly with the number ofparameters, but rather each is a relatively small incremental cost.Overall, concurrent pre-computing of multiple filers has the potentialfor significant savings in processing resources.

[0014] Although the primary benefit of the invention is to improvecomputational efficiency and provide enhanced end-user functionality,there are further benefits for client/server and similar network-basedenvironments. The invention permits, indeed encourages, the end-user tomake decisions which ultimately reduce the amount of data required to betransmitted across the network.

[0015] The environment in which the present invention is used is that ofa general purpose computing facility connected with a number ofdatabases. It is typically used by a number of simultaneous end-users,although that aspect is not relevant to the operation of the invention.The computing facility may comprise a number of interconnectedcomputers, and the databases and users may be co-located or remotelylocated. Interconnection of these elements, whether or not co-located,might be over a network such as the Internet.

[0016] Other aspects of the present invention will become apparent tothose ordinarily skilled in the art upon review of the followingdescription of specific embodiments of the invention in conjunction withthe accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0017]FIG. 1 shows the computing environment typically used inimplementing the preferred embodiment of the invention.

[0018]FIGS. 2, 2a, 2 b and 2 c depict a flowchart of the operation of apreferred embodiment of the invention, from the end-user's perspective,

[0019]FIG. 3 shows in a somewhat simplified manner, the various tablesrequired to implement a preferred embodiment of the invention.

[0020]FIG. 4 shows how the filter status data can be used to producemany filtered views from a base cross-tabulation for which multiplefilters have been evaluated.

DESCRIPTION OF PREFERRED EMBODIMENTS

[0021] Referring to FIG. 1, the invention is implemented as softwarerunning on a general purpose computing facility 10 which has access to anumber of databases 11 and is typically used by a number of end-users12.

[0022] Filter Expressions:

[0023] A filter expression is any logical predicate with a singleunbound variable, which is evaluated by taking the value of a cellwithin a cross-tabulation as the value of the unbound variable. Thepreferred embodiment is assumed to contain an expression evaluationfacility capable of evaluating arbitrary logical predicates.

[0024] The simplest filter expression is an arbitrary logical predicatein the form of a function with a single real valued argument (the cellvalue) e.g. V_(ij)>1.23, where V_(ij) is The value in the cell at theintersection of the i'th row and the j'th column. The functions may bearbitrarily complex and may involve arithmetic functions such as ABS,MOD or SQUARE_ROOT; comparison operators such as LESS THAN (<) or NOTEQUALS (≠) as well as logical operators such as AND or NOT. The cellvalue V_(ij) which is the argument to the function, may appeararbitrarily many times within the function. A typical filter expressionof This type would determine whether each of the values fell within aspecified range. This is expressed as:

(V_(ij)>1.23 & V_(ij)<4.56)

[0025] A richer repertoire of filter expressions may be obtained if theexpressions may contain functions which take as an argument the entireset of values within the cross-tabulation. For example, the mean(average) value of the cells within a cross-tabulation may berepresented as ƒ(V) where ƒ represents the particular function and Vrepresents the full set of values within the entire cross-tabulation. Anarbitrary predicate thus has the general form, ƒ(V_(ij),ƒ₁, ƒ₂. . . ƒ₃).A typical filter expression of this type would test whether the value ina cell is 2 standard deviations above the mean, This is expressed as;

((V_(ij)—MEAN(V)/(2* SQUARE_ROOT(V))>2)

[0026] Many of the interesting functions of the form ƒ(V) are made moreuseful if they may be parameterized. For example, the percentilefunction which finds the value which is larger (or smaller) than aspecified percentage of the values of set is more useful if thepercentage can be defined. Thus a useful form of the percentile functiontakes two arguments: a set containing the values in the cross tabulationand a number between 1 and 99 indicating the percentage. The full formis thus ƒ(V, c), where c is the control value which controls theevaluation of the function. A typical filter expression of this typewould determine whether each of the values fell between the first andthird quartiles for the values ill the table. This is expressed as:

(V_(ij)>BOTTOM_PERCENTILE(V, 25) & V_(ij)<TOP_PERCENTILE (V, 25))

[0027] The result of evaluating a function of the form or ƒ(V) or ƒ(V,c) is termed a run time parameter of the filter expression.

[0028] Selection of Parameter Types:

[0029] Unlike the basic logical predicate in the form of a function witha single real valued argument the filter expression containing functionsevaluated over the set V cannot be simply expressed because thedefinition of bow to compute the functions ƒ(V) or ƒ(V, c) can be quitecomplex. A practical embodiment must constrain the ƒ(V) or ƒ(V, c) to apredefined set.

[0030] The types of run time parameter values which can be defined forinclusion within simple logic predicates are without theoretical limitbut a practical set would be based on; a) counts; b) sums; c) extremevalues; d) central and non-central moment statistics, including the meanand standard deviation; e) order statistics, including percentiles,deciles, quaniles and the median; f) values or cumulative sums atspecified ordinal positions

[0031] Twelve functions over the set of values in the cross-tabulationwhich have been chosen as being these with the highest utility for thetypes of data under consideration are defined for use here. Six of theseexpressions are simple (or univariate) run time parameters —ƒ(V)— and afurther six are run time parameters with control values provided by theend-user -—ƒ(V, p). They appear in Table 1. TABLE 1 Expression TypeExamples Simple statistical expressions Count Standard Deviation SumMaximum Mean Minimum Statistical expressions Top Count n BottomPercentile n with control values Bottom Count n Top Sum n Top Percentilen Bottom Sum n

[0032] All of these filter expressions are well understood in the art.The following brief definitions are included for reference.

[0033] Top Count n: The n'th largest value in the set V.

[0034] Bottom Count n: The n'th smallest value in the set V.

[0035] Top Percentile n: In a size ordered set of values, the smallestvalue in the set of maximal values which constitute n% of the set V.

[0036] Bottom Percentile n: In a size ordered set of values, the largestvalue in the set of minimal values which constitutes n% of the sum forthe set V.

[0037] Top Sum n: The value in a size ordered set of values which issmaller than the set of maximal values which sum to n% of the sum forthe set V.

[0038] Bottom Sum n: The value in a size ordered set of values which islarger than the set of minimal values which sum to n% of the sum for theset V.

[0039] It will be obvious to an ordinarily skilled practitioner of theart that more filter expressions, both simple (e.g. mode) and withcontrol values (e.g. explicit deciles)₁ could be added to the list inTable 1, but they would have little value in most domains. However,their inclusion or exclusion does not change the nature of theinvention.

[0040] Operation of the Preferred Embodiment:

[0041] The flowchart FIG. 2 and the FIGS. 3 and 4 illustrate a preferredembodiment of the present invention in which an end-user interacts withthe computing system and data. Referring to FIGS. 2 and 3, in the firststep to be considered 100, The end-user may elect to ascertain andconstruct filter expressions from primitives with the help of anexpression editor and, where a parameter is required for the filterexpression, may elect to provide that information as well. In the nextstep 110, the end-user constructs expressions from primitives with thehelp of an expression editor and, where a parameter is required for thefilter expression, provides that information. Subsequently, the end-user120 selects the data to be examined, as well as the preferred filterexpressions to be used in manipulating the data, which are stored in210. The data are retrieved 130 from the appropriate database repositoryand stored in the cross-tabulation 200. The computing system examinesthe filter expressions 140 selected by the end-user 120 and, ifrequired, the parameter values are evaluated and stored 150 in the RunTime Parameter table 230 as previously described. In turn, the selectedfilter expressions are then evaluated 160 for each data value and theresults are stored in the appropriate Expression Status table 220, Inturn the expression status values for each row and column are computedand stored 150 in the appropriate filter expression tables 210. Thefilter set is displayed 170 to the end-user to permit selection of thedesired view. Referring now also to FIG. 4, if the end-user makes aselection 180, the computing system filters the Data so Table 300according to the Expression status information for the rows 310, and the

[0042] Expression status information for the columns 312 and presents190 a filtered cross-tabulation 320 to the end-user. Finally, if nofurther filtered views are requested, the process terminates 199.

[0043] Example Using the Invention.

[0044] The utility of the invention is further explained by way of thefollowing example.

[0045] Consider a situation in which a manager wishes to produce a listof those salespeople who have consistently had gross sales above themean for every quarter of the year (and give Them bonuses!), and alsoanother list of those salespeople who have NEVER been in the top 25% ofgross sales for any quarter during the year (and replace them!).

[0046] First, a temporary (or ephemeral) data table known as a‘cross-tabulation’ is created (from information stored in a databaserepository) in which the ‘rows’ represent individual salespeople, and‘columns’ represent calendar quarters, values in the table being thegross sales by salesperson/quarter.

[0047] For both the ‘rows’ and ‘columns’ there is also created a set of‘filter expressions’, selected or determined by the manager from a listof possible filter expressions. Multiple filter expressions can berequested simultaneously by the manager, if desired The set of filterexpressions may be the same for the rows and columns but need not be.

[0048] The set of rows in the cross-tabulation and The set of filterexpressions for the rows imply a table of ‘expression status values’which has as many rows as the number of rows in the cross-tabulation andas many columns as the number of filter expressions associated with therows. Each of these ‘expression status values’, once computed, has thevalue ‘True’ or ‘False’: ‘False’ meaning that there is at least one cellvalue in the row for which the filter expression evaluates to False;‘True’ meaning that for all cell values in The row, the filterexpression evaluates to True. There is likewise a set of filterexpressions and an expression status value table for columns to whichthe foregoing applies mutatis mutandis.

[0049] In this example, the manager chose two filter expressions, onebased on Mean and the other on Top Percentile (25). The system canconstruct on demand a display table consisting of either.

[0050] Salespeople whose sales in the previous four quarters are abovethe Mean, or

[0051] Salespeople who have not been in the Top 25 Percentile in any ofthe preceding four quarters.

[0052] The system, when requested to display the data as requested bythe manager then creates another (viral) table comprising only thoserows and columns for which the selected ‘expression status values’ are‘True’. This can be achieved by examining the ‘expression status values’for appropriate rows (i.e., for each Salesperson) without incurring thesignificant cost of processing and other resources in recomputing thesestatus values each time.

[0053] The manager can apply, or re-apply, multiple filters with a lowcomputational (and time) cost because the time-consuming operations ofdata retrieval and selector calculation have already been performed.

[0054] The invention can be implemented in digital electronic circuitryor in computer hardware, firmware, software or in combinations thereofApparatus of the present invention can be implemented in a computerprogram product tangibly embodied in a machine-readable storage devicefor execution by a programmable processor. The method of the presentinvention can be performed by a programmable processor executing aprogram of instructions to perform the functions of the invention byoperating on input data and generating output.

[0055] Numerous modifications, variations and adaptations Play be madeto he particular embodiments of the invention described above withoutdeparting from the scope of the invention, which is defined in theclaims.

What is claimed is:
 1. A computer-based method of evaluating data byselecting the highest semantic content of a table of data, comprisingthe steps of: constructing a cross-tabulation of data from one or moredatabases; ascertaining fitter expressions to be applied to said data;evaluating said filter expressions using said data; and storing theresults of said evaluation in a status table
 2. The method of claim 1wherein after the step of storing the results of said evaluation thefollowing steps take place: selecting the particular filter expressionsto be used in viewing said cross-tabulation; and using said status tabletogether with said particular filter expressions to determine the rowsand columns to be included in viewing said cross-tabulation.
 3. Themethod of claim 2 wherein after the ascertaining step the followingsteps take place: accepting arguments from the end-user to be used withsaid filter expressions; and completing said filter expressions withsaid arguments.
 4. The method of claim 3 wherein for any of said filterexpressions which include filter expressions requiring the evaluation ofstatistical information derived from said cross-tabulation the followingsteps take place before the evaluating step: calculating the ran timeparameters necessary to evaluate said filter expressions; and assemblingsaid filter expressions using said run time parameters.
 5. The method ofclaim 4 wherein after the using step the following step Takes place:presenting the end-user with the results of The using step.
 6. Themethod of claim 5 wherein the following steps are performed using anexpression editor: ascertaining the filter expressions to be applied tothe data; and accepting arguments to be used with said filterexpressions.
 7. The method of claim 6 wherein said filter expressionsare assembled using functions selected from the group consisting of a)counts, b) sums, c) extreme values, d) central and non-central momentstatistics including the mean and standard deviation, e) orderstatistics, including percentiles, f) values or cumulative sums atspecified ordinal positions.
 8. The method of claim 7 wherein saidexpression editor is used to construct more complex expressions bycombining said set of filter expressions with logical operations,including AND and OR.
 9. A computer-based method of evaluating data byselecting the highest semantic content of a table of data, comprisingthe steps of constructing a cross-tabulation of data from one or moredatabases; ascertaining with an expression editor filter expressions tobe applied to said data, said filter expressions being selected from thegroup consisting of a) counts, b) sums, c) extreme values, d) centraland non-central moment statistics including the mean and standarddeviation, e) order statistics, including percentiles, f) values orcumulative sums at specified ordinal positions; accepting with saidexpression editor arguments from the end-user to be used with saidfilter expressions; completing said filter expressions with saidarguments; calculating the run time parameters necessary To evaluatesaid filter expressions; assembling said filter expressions using saidrun time parameters; evaluating said filter expressions using said data;storing the results of said evaluation in a status table; selecting theparticular filter expressions, including those more complex expressionsconstructed with said expression editor by combining said set of filterexpressions with logical operations, including AND and OR To be used inviewing said cross-tabulation; using said status table together withsaid particular filter expressions to determine The rows and columns tobe included in viewing said cross-tabulation; and presenting teeend-user with the results of the using step.
 10. The method of claim 9wherein the selecting, using and presenting steps are repeated asrequired to give the end-user different views of said cross-tabulation.11. A computer-readable medium having executable code stored thereon,the code for evaluating data by selecting the highest semantic contentof a table of data, the code comprising; code to construct across-tabulation of data from one or more databases; code to ascertainfilter expressions to be applied to said data; code to evaluate saidfilter expressions using said data; and code to store the results ofsaid evaluation in a status table.
 12. A computer-readable medium havingexecutable code stored thereon, the code for evaluating data byselecting the highest semantic content of a table of data, the codecomprising: code to construct a cross-tabulation of data from one ormore databases; code to ascertain wide an expression editor filterexpressions to be applied To said data, said filter expressions beingselected from the group consisting of a) counts, b) sums, c) extremevalues, d) central and non-central moment statistics including the meanand standard deviation, e) order statistics, including percentiles, f)values or cumulative sums at specified ordinal positions; code to acceptwith said expression editor arguments from the end-user to be used withsaid filter expressions; code to complete said filter expressions withsaid arguments; code to calculate the run time parameters necessary toevaluate said filter expressions; code to assemble said filterexpressions using said run time parameters; code to evaluate said filterexpressions using said data; code to store the results of saidevaluation in a status Table; code to select the particular filterexpressions, including Those more complex expressions constructed withsaid expression editor by combining said set of filter expressions withlogical operations, including AND mid OR to be used in viewing saidcross-tabulation; code to use said status table together with saidparticular filter expressions to determine the rows and columns to beincluded in viewing said cross-tabulation; and code to present theend-user with the results of the use code,
 13. A computer-basedapparatus which evaluates data by selecting the highest semantic contentof a table of data, comprising the following: means for constructing across-tabulation of data from one or more databases; means forascertaining filter expressions to be applied to said data; means forevaluating said filter expressions using said data; and means forstoring the results of said evaluation in a status table.
 14. Acomputer-based apparatus which evaluates data by selecting the highestsemantic content of a table of data, comprising the following: means forconstructing a cross-tabulation of data from one or more databases;means for ascertaining with an expression editor filter expressions tobe applied to said data, said filter expressions being selected from thegroup consisting of a) counts, b) sums, c) extreme values, d) centraland non-central moment statistics including the mean and standarddeviation, e) order statistics, including percentiles, f) values orcumulative sums at specified ordinal positions; means for accepting withsaid expression editor arguments from the endures to be used with saidfilter expressions; means for completing said filter expressions withsaid arguments; means for calculating the run time parameters necessaryto evaluate said filter expressions; means for assembling said filterexpressions using said run time parameters; means for evaluating saidfilter expressions using said data; means for storing the results ofsaid evaluation in a status table; means for selecting the particularfilter expressions, including those more complex expressions constructedwith said expression editor by combining said set of filter expressionswith logical operations, including AND and OR, to be used in viewingsaid cross-tabulation; means for using said status table together withsaid particular filter expressions to determine the rows and columns lobe included in viewing said cross-tabulation; and means for presentingthe end-user with the results of the using step.